We have now discussed the design details of many large-scale systems. In each design problem, our goal was not only to meet the functional requirements but also to ensure that non-functional requirements like scalability, availability, and low latency were met. While we designed our problems carefully, system (API) failures are inevitable because many factors can adversely affect the operation of the API, resulting in service disruption or outage.

In this chapter, we will learn about API failures, their impact on businesses, and what causes them. We will also look at the details of some real-world examples of API failures and discuss their corresponding mitigation techniques.

What is API failure?#

APIs can enhance user experience and benefit companies if they’re working as intended. However, APIs can underperform or may experience a service disruption. An API failure refers to the malfunctioning (returning unexpected errors or being completely inaccessible) of API-based services. API failure is not limited to direct request and response failures. Since APIs are intended to be used as a means of communication between clients and servers, they can also serve as a means for service abuse, security issues, and other malicious activities that may lead to system failures. Generally, any failure caused or triggered by an API can be considered an API failure.

svg viewer

Note: The API itself may not be the problem. API failures often occur due to weak security mechanisms, poor implementation, or other flaws in the systems connected by the API.

What is the impact of API failures?#

Consumer applications rely heavily on APIs. API calls are usually the only point of interaction between a client and server. A service interruption or unavailability caused by API failure can have the following effects:

  • Service disruption: API outage may render the associated application unusable.

  • Data loss: Corruption, loss, or unauthorized access to critical or sensitive information can have a devastating impact on business operations.

  • Consumer frustration: Unpredictable and inconsistent behavior of APIs can lead to a loss of trust and loyalty from customers, ultimately leading customers to seek alternatives.

  • Financial loss: Loss of revenue, penalties from law enforcement agencies, contractual obligations (SLAs), etc., can result in significant commercial and financial loss.

svg viewer

Most significantly, API failures may damage a company's reputation, causing a negative impression of the business. Therefore, in addition to regularly testing and auditing APIs to ensure they’re functioning as expected, it’s also important to understand the reasons for API failures.

Point to Ponder

Question

How does an API failure lead to data loss?

Hide Answer

API failures can lead to data loss in several ways. Some of them are as follows:

  • Data can be lost if the API fails to properly store or update data without handling errors or exceptions.

  • APIs can also lead to malicious code execution, allowing hackers to destroy or exfiltrate data without permission.

  • APIs that allow updating database records can lead to data inconsistency or corruption if the database changes are reverted after the API response is returned and the API fails to reflect this change to the client.

Reasons for API failures#

An API can fail for a number of reasons. We have grouped common causes into the following broad categories.

Infrastructure issues#

Communication between the client and service is only possible if both communicating parties are online. Infrastructure issues can lead to disrupted communication and even service breakdown. We can further divide infrastructure problems into the following two types:

  • Network failures: The APIs can fail due to communication failure between the client and service caused by network congestion (unexpected spike in network traffic, DDoS attacks, etc.), network equipment failure, severed communication links, or network outages caused by natural disasters (power outages, fires, storms, earthquakes, etc.).

  • Backend failures: Back-end components (API gateways, servers, etc.) can also fail due to equipment failures, overloads, scalability issues, etc., resulting in limited or no connectivity. Although there are often alternative services to avoid such failures, if these redirects are not seamless or take longer than expected, they may cause the request to time out, making the service unavailable to consumers.

API failure due to infrastructure issues
API failure due to infrastructure issues

Security issues#

APIs can fail due to security flaws, allowing hackers to take over the control of the system. This can be a root cause of API failure nowadays. The following are the most common vulnerabilities that can lead to API failure:

  • Weak access control: API services requiring basic or no authentication and allowing users to access API endpoints without proper authorization checks (such as BOLA) can lead to serious security threats.

  • Excessive data exposure: APIs sending excessive data in responses, even those not requested by the user, can be the reason for API vulnerability.

  • Data validation checks: When developing an API, we cannot trust that there won’t be hackers among our users. It’s necessary to validate incoming data on the server side because it’s easy for a hacker to pass the client-side validation and inject malicious data (SQL injection, XSS, Remote Code Execution) in the input fields while sending routine requests to access API functionality.

  • Deferring security fixes: Postponing security updates, especially when those updates have a high severity rating, can have devastating consequences. Once a security issue is discovered, hackers start scouring the web for vulnerable systems. Therefore, we should apply security patches as soon as possible after they are released.

Hackers exploiting vulnerable systems to steal sensitive information
Hackers exploiting vulnerable systems to steal sensitive information

The aforementioned security concerns do not include an extensive list of vulnerabilities, and there are many other factors, such as CORS policies, data encryption, DDoS attacks, identity theft, etc., that can also cause an API to fail.

Development bugs#

Development bugs can also lead to API failure. Two common types of issues are listed below:

  • Frequently occurring changes: Too many frequent code changes can lead to poor development or flawed design decisions. It's best to first decide on the feature set, then update the blueprint, implement the code, verify the implementation with tests, etc. An entire cycle must be met before subsequent changes are implemented. If changes are too frequent, managing complete cycles between teams can be difficult, leading to development errors and issues. Additionally, code documentation that is updated too frequently can confuse users and reduce their productivity.

  • Testing systems: Thoroughly testing the system is as important as creating functionality the first time. We must exhaustively test our API against each endpoint; otherwise, it can lead to system crashes, communication breakdowns, and data losses.

API failure due to undetected development bugs
API failure due to undetected development bugs

Fundamental design flaws#

Our API is doomed to fail in the long run if we don't consider the most common and fundamental techniques in the design, especially when meeting its non-functional requirements (availability, scalability, low latency, etc.). In this course, we have extensively discussed the advantages and use cases of these technologies. Let's briefly describe the following most common techniques as a refresher:

  • Load balancing: Balancing and distributing incoming requests to the available servers is one of the basic techniques to improve the scalability of the system.

  • Caching: Caching data with proper cache control policy at different levels (client, ISP, CDNs, etc.) is one of the primary ways to reduce latency and improve the performance of the system.

svg viewer
  • Limiting requests: Applying caps to incoming requests based on different policies (gateway rate limiting, request quotas, per-resource access, etc.) improves the availability of the system.

  • Circuit breakers: With circuit breakers and mechanisms to check server health, we can avoid cascading failures and ensure the reliability of our services.

  • Monitoring systems: With the API Monitoring and Alerting Dashboard, we can diagnose service-related issues and vulnerabilities and take proactive steps to ensure the reliability and security of our API.

This is a non-exhaustive list. Companies may also use other techniques to become reliable service providers.

Third-party dependencies#

Most modern systems are more of a collaborative environment than systems that work independently. APIs are one of the main reasons for enabling this collaboration across disparate systems. Failures can sometimes happen when one of the dependencies of an API fails to perform its function, causing the whole system to fall. Let's take C3 APIs as an example. They rely on Amazon to perform their functionality. If, for some reason, Amazon is unavailable, then the C3 APIs will suffer disruption due to their dependence on Amazon services.

API failure due to third-party dependency
API failure due to third-party dependency

In this lesson, we learned that there are many factors that can make an API fail and many consequences of different kinds of API failures. In the next lesson, we’ll see examples of well-known API failures. After that, we’ll look at some mitigation strategies to avoid such failures in subsequent lessons.

Gaming API Design Evaluation and Latency Budget

Knight Capital Failure Due to Development Bug